{% extends 'base.html' %} {% block head %} {% endblock %} {% block content %}
From a citizen and pedestrian perspective: We want a safe journey in Melbourne. Which intersections are safest and which are the riskiest from a road safety perspective? Where are accident hot-spots occurring and under what circumstance?
From a council perspective: As a council we want to invest in road safety initiatives which can effectively reduce serious injuries and fatalities. Are the current approaches to road network design having the impact we expected?
This use case is an extension from the Melbourne Bicycle Network Route & Road Safety analysis that was created in Trimester 1 2022. We can utilise the VicRoads traffic accident data and aggregate this with the pedestrian paths Melbourne open dataset.
Using the power of data aggregation, we can combine Melbourne Open datasets such as transport networks and events With open government datasets including traffic accident ‘crash stats’ from Victoria Police and traffic event data from VicRoads and begin to observe, analyze and report on geographical patterns between these datasets.
We can ask questions such as:
Goals for exploratory data analysis:
This use case and exploratory data analysis project can support the City of Melbourne in the following ways:
Support for the ‘Safety and Well-being’ strategic vision and goals
Influence the creation of a ‘key risk indicator’ to monitor progress on the reduction of the 'Number of transport-related injuries and fatalities’ on Melbourne roads
Support further discussion between City of Melbourne and Victorian Road Safety partner agencies to improve road network design and infrastructure programs
To cite some key pedestrian road safety statistics, sourced from the Transport Accident Commission:
In the last five years, 175 pedestrians have been killed on Victorian roads. There are many more who are injured or seriously injured. Pedestrians make up around 15% of the total number of road deaths each year.
The approach to aggregating key data sources and analysing geographical attributes is currently used by the TAC (Transport Accident Commission) in Victoria when analysing accident hot-spots and reviewing whether the design of the road could be improved to reduce road trauma.
This type of analysis was used by TAC in recent years to assess fatal accident hotspots in Geelong.
The TAC in partnership with the Victorian Road Safety parntering agencies discovered a cluster of fatal accidents occurring over a 5-year period along a specific stretch of road at Thompsons Road, North Geelong.
The analysis informed a strategic decision for road safety partners (Victoria Police, VicRoads, City of Greater Geelong, TAC) to re-design the road to make it safer.
The road re-design has resulted in a substantial reduction in road trauma along Thompsons Road in North Geelong.
A similar analysis technique and approach could be applied to the City of Melbourne road network
REFERENCE:
Document the data considerations and risk assessments
Prepare the Traffic Accident 'crash-stats' source data (this is handled by a separate python notebook)
Access and read-in the Melbourne Pedestrian Network dataset via the SOCRATA API
Explore the Melbourne Pedestrian Newtwork dataset as a geoJSON file
Read-in the pre-processed Traffic Accident 'crash-stats' dataset
Explore the Traffic Accident 'crash-stats' dataset
Visualise the geographical features of the Melbourne Pedestrian Network overlayed with Traffic Accident 'crash-stats' dataset
Dataset list:
1. Information Security and Sensitivity
For the purpose of analysis, the analysis datasets contain de-identified data. No personally identifyable names or contact details are used or included.
2. Converting raw traffic accident 'crash-stats' data into useful dataset
After initial observation of the traffic accident data in its raw form, the raw data was prepared and converted into a working ‘.csv’ file and imported into this notebook for further analysis.
The following process was used for converting the raw data into a working dataset:
The accident context domains 'person', 'accident' and 'node' were used to form the foundation of the working dataset
A series of two inner merges were then performed to construct the working dataset
To obtain additional traffic accident descriptive features, five additional data domains were left joined in sequence
Variable naming conventions were applied
Variable features which were created in the working dataset use a three-letter acronym prefix to denote the expected general data type values:
A suffix beginning with an underscore was also used to denote the context data domain origin for each feature. For example "_person" denotes a variable which originated from the accident person domain dataset.
Manual data inspection notes:
After creating the working dataset, data opportunities were discovered to create new variables to assist with the analysis:
3. Data cleaning & pre processing
Excess text whitespace characters were detected in variables 'TIMAccidentTime_accident' and 'CATDCADesc_accident', these were removed.
4. Geographical Location Data
In order to answer queries on geographical locations for accidents, the analysis dataset requires longitude and latitude data in order to instruct geographical mapping tools and visualisations. The longitude and latitude data is captured when accident records are entered into the source system.
5. Additional Data
None identified.
6. Data Integrity Checks and Filtering
To begin the analysis we first import the necessary libraries to support our exploratory data analysis using Melbourne Open data.
The following are core packages required for this exercise:
###################################################################
# Libraries used for this use case and exploratory data analysis
###################################################################
import os
import time
import sys
sys.path.insert(1, '../') # so that we can import d2i_tools from the parent folder.
from d2i_tools2 import *
import warnings
warnings.simplefilter("ignore")
from datetime import datetime, date
import numpy as np
import pandas as pd
import geopandas
from sodapy import Socrata
import json
import plotly.express as px
import folium
from folium.plugins import MarkerCluster
import seaborn as sns
import matplotlib.pyplot as plt
To connect to the Melbourne Open Data Portal we must establish a connection using the sodapy library by specifying a domain, being the website domain where the data is hosted, and an application access token which can be requested from the City of Melbourne Open Data portal by registering here
For this exercise we will access the domain without an application token.
########################################################
# Accessing the Melbourne City Pedestrian Network Dataset
########################################################
# Hyperlink to the dataset: https://data.melbourne.vic.gov.au/Transport/Pedestrian-Network/4id4-tydi
dataset_id = '4id4-tydi' #Melbourne City Pedestrian Network dataset
apptoken = os.environ.get("SODAPY_APPTOKEN") # Anonymous app token
domain = "data.melbourne.vic.gov.au"
client = Socrata(domain, apptoken) # Open Dataset connection
Next, we will look at the Pedestrian-Network dataset, to better understand its structure and how we can use it.
Our data requirements from this use case include the following:
For this exercise, we start by examining the Pedestrian-Network dataset. Each dataset in the Melbourne Open Data Portal has a unique identifier which can be used to retrieve the dataset using the sodapy library.
The Pedestrian-Network dataset unique identifier is '4id4-tydi'. We will pass this identifier into the sodapy command below to retrieve this data.
This dataset is placed in a Pandas dataframe and we will inspect the metadata.
Working with the Melbourne Pedestrian Network Routes Dataset as a JSON file
The code below describes how to access the Pedestrian Network dataset as a JSON file through the SOCRATA API.
import requests
url = 'https://data.melbourne.vic.gov.au/download/4id4-tydi/application%2Fzip'
content = requests.get(url)
# unzip the content
from io import BytesIO
from zipfile import ZipFile
f = ZipFile(BytesIO(content.content))
print(f.namelist())
['Property_centroid.json', 'Pedestrian_network.json']
Working with the Melbourne Pedestrian Network Dataset as a JSON file
The code below describes how to access the Pedestrian Network dataset as a JSON file through a website hyperlink.
#Download the json files and store locally
import zipfile, urllib.request, shutil
url = 'https://data.melbourne.vic.gov.au/download/4id4-tydi/application%2Fzip'
file_name = 'pedestriannetwork.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
with zipfile.ZipFile(file_name) as zf:
zf.extractall()
import json
with open('Pedestrian_network.json') as file:
pedestrianpath = json.load(file)
Accessing the first record in the JSON file
To observe the type of data and values stored within the JSON file we can use the following code to observe the first record.
pedestrianpath["features"][0]
{'type': 'Feature',
'geometry': {'type': 'LineString',
'coordinates': [[144.9825360891, -37.8452225896],
[144.983921128, -37.845413993]]},
'properties': {'OBJECTID': 1,
'NETID': 1,
'TYPE': 1,
'MCCID': 1389774,
'MCCID_A': 0,
'MCCID_B': 0,
'OTIME': ' ',
'CTIME': ' ',
'COST': 1.85613352795,
'Shape_Length': 123.74223519653556,
'DESCRIPTION': 'Pestrian Footpath',
'TRAFFIC': 'High Traffic'}}
Observing the JSON Full Structure and Properties
By calling the variable 'pedestrianpath' we can observe the full structure, properties and values of the JSON file.
pedestrianpath
# To observe just the geographical longitude and Latitude coordinates we can use:
#pedestrianpath['features'][0]['geometry']
Navigating the JSON File Structure
When you load a JSON file using the json library, you get a dictionary that contains an entry 'features', which contains the list of features. Each feature in turn consists of a dictionary, which, contains an entry 'geometry'.
The geometry is a dictionary containing the entries 'type' and 'coordinates'.
The JSON file can be traversed or navigated using the following code:
for feature in pedestrianpath['features']:
print (feature['geometry']['type'])
print (feature['geometry']['coordinates'])
#Convert the JSON file to a geopandas data frame
gpd_pedestrianpath = geopandas.read_file('Pedestrian_network.json')
gpd_pedestrianpath.head()
ERROR:fiona._env:PROJ: proj_create_from_database: Cannot find proj.db ERROR:fiona._env:PROJ: proj_create_from_wkt: Cannot find proj.db
| OBJECTID | NETID | TYPE | MCCID | MCCID_A | MCCID_B | OTIME | CTIME | COST | Shape_Length | DESCRIPTION | TRAFFIC | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 1389774 | 0 | 0 | 1.856134 | 123.742235 | Pestrian Footpath | High Traffic | LINESTRING (144.98254 -37.84522, 144.98392 -37... | ||
| 1 | 2 | 2 | 1 | 1389774 | 0 | 0 | 0.031922 | 2.128135 | Pestrian Footpath | High Traffic | LINESTRING (144.98042 -37.84496, 144.98041 -37... | ||
| 2 | 3 | 3 | 1 | 1468181 | 0 | 0 | 3.098050 | 206.536685 | Pestrian Footpath | Low Traffic | LINESTRING (144.98041 -37.84494, 144.97973 -37... | ||
| 3 | 4 | 4 | 1 | 0 | 0 | 0 | 1.408645 | 93.909699 | Pestrian Footpath | Low Traffic | LINESTRING (144.98475 -37.84455, 144.98493 -37... | ||
| 4 | 5 | 5 | 1 | 0 | 0 | 0 | 0.515907 | 34.393783 | Pestrian Footpath | Low Traffic | LINESTRING (144.98531 -37.84377, 144.98493 -37... |
#Enhance the efficiency of plotting the dataset by filtering to 'High Traffic' footpath areas.
#Select columns
gpd_pedestrianpath_filtered = gpd_pedestrianpath[['Shape_Length', 'TRAFFIC', 'geometry']]
#Filter
gpd_pedestrianpath_filtered = gpd_pedestrianpath_filtered[gpd_pedestrianpath_filtered['TRAFFIC'].str.contains('High Traffic')]
gpd_pedestrianpath_filtered.head()
| Shape_Length | TRAFFIC | geometry | |
|---|---|---|---|
| 0 | 123.742235 | High Traffic | LINESTRING (144.98254 -37.84522, 144.98392 -37... |
| 1 | 2.128135 | High Traffic | LINESTRING (144.98042 -37.84496, 144.98041 -37... |
| 10 | 0.692685 | High Traffic | LINESTRING (144.98254 -37.84522, 144.98254 -37... |
| 12 | 2.187118 | High Traffic | LINESTRING (144.98391 -37.84540, 144.98392 -37... |
| 16 | 2.682725 | High Traffic | LINESTRING (144.98548 -37.84293, 144.98548 -37... |
Visualising the Melbourne Pedestrian Network on a Map
To visualise the JSON file containing the Melbourne Pedestrian Network we can use the 'folium' and 'json' and 'geopandas' packages and the following code.
gpd_pedestrianpath.crs = {'init' :'epsg:4326'}
m = folium.Map([-37.81368709240999, 144.95738102347036], zoom_start=12)
folium.Choropleth(
#gpd_pedestrianpath[gpd_pedestrianpath.geometry.length>0.0015], #Optional: select only lines above specified length to plot
gpd_pedestrianpath_filtered,
line_weight=3,
line_color='blue',
control_scale=True,
prefer_canvas=True,
width=800,
height=580
).add_to(m)
m
This section focuses on setting up the Traffic Accident 'Crash-Stats' dataset and preparing it for use in the exploratory data analysis alongside the Melbourne Pedestrian Network dataset.
The raw input dataset contains the following structure:
#Read in the dataset
raw_accidents_pedestrians = pd.read_csv('interactive_dependencies/Accidents_Pedestrians_Melbourne_2008to2020.csv', parse_dates=['DATAccidentDate_accident'])
raw_accidents_pedestrians.info() # see summary information of the data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2028 entries, 0 to 2027 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 KEYAccidentNumber 2028 non-null object 1 DATAccidentDate_accident 2028 non-null datetime64[ns] 2 TIMAccidentTime_accident 2028 non-null object 3 CATAccidentTypeDesc_accident 2028 non-null object 4 CATDayOfWeek_accident 2028 non-null object 5 CATDCADesc_accident 2028 non-null object 6 CATMelwaysPage_accident 2028 non-null object 7 CATMelwaysGridRef_X_accident 2028 non-null object 8 CATMelwaysGridRef_Y_accident 2028 non-null object 9 CATLightConditionDesc_accident 2028 non-null object 10 NUMVehiclesInvolved_accident 2028 non-null int64 11 NUMPersonsInvolved_accident 2028 non-null int64 12 NUMPersonsInjured_accident 2028 non-null int64 13 KEYPersonID_person 2028 non-null int64 14 CATRoadUserTypeDesc_person 2028 non-null object 15 CATTakenHospital_person 2028 non-null object 16 CATInjuryLevelDesc_person 2028 non-null object 17 CATAgeGroup_person 2028 non-null object 18 CATPostcode_person 1454 non-null float64 19 CATGender_person 2028 non-null object 20 CATLGAName_node 2028 non-null object 21 CATDEGUrbanName_node 2028 non-null object 22 NUMLatitude_node 2028 non-null float64 23 NUMLongitude_node 2028 non-null float64 24 CATPostcode_node 2028 non-null int64 25 CATSurfaceConditionDesc_surface 2028 non-null object 26 CATSubDCACodeDesc_subdca 1982 non-null object 27 CATAtmosphericConditionDesc_atmosphere 2028 non-null object 28 CATRoadName_acclocation 2027 non-null object 29 CATRoadNameInt_acclocation 2016 non-null object 30 CATRoadType_acclocation 2021 non-null object 31 CATRoadTypeInt_acclocation 2010 non-null object 32 CATEventTypeDesc_accevent 2028 non-null object 33 CATObjectTypeDesc_accevent 2028 non-null object dtypes: datetime64[ns](1), float64(3), int64(5), object(25) memory usage: 538.8+ KB
Setting up the Working Accident 'Crash-Stats' Dataset
The working dataset will have the following structure.
#Create a copy of the raw source dataset
wrk_accident_pedestrians = raw_accidents_pedestrians.copy()
#Create new features from the accident date variable such as a numerical representation of weekday name, week of the year
#day of the year and a separate variable to hold the year of accident.
wrk_accident_pedestrians['NUMDayOfWeek'] = wrk_accident_pedestrians['DATAccidentDate_accident'].dt.strftime('%w')
wrk_accident_pedestrians['NUMWeekOfYear'] = wrk_accident_pedestrians['DATAccidentDate_accident'].dt.strftime('%W')
wrk_accident_pedestrians['NUMDayOfYear'] = wrk_accident_pedestrians['DATAccidentDate_accident'].dt.strftime('%j')
wrk_accident_pedestrians['NUMYearOfAcc'] = wrk_accident_pedestrians['DATAccidentDate_accident'].dt.strftime('%Y')
#Convert the time of accident to a string and clean up excess white space
wrk_accident_pedestrians.TIMAccidentTime_accident = wrk_accident_pedestrians.TIMAccidentTime_accident.astype('string')
wrk_accident_pedestrians.TIMAccidentTime_accident = wrk_accident_pedestrians.TIMAccidentTime_accident.str.rstrip()
#Using the time of accident variable, create new features including accident hour, minute and second
wrk_accident_pedestrians[['hour','minute','second']] = wrk_accident_pedestrians['TIMAccidentTime_accident'].astype(str).str.split(':', expand=True).astype(str)
#Create a new feature to combine the week day name and hour of accident
wrk_accident_pedestrians['CATWeekDayHour'] = wrk_accident_pedestrians[['CATDayOfWeek_accident', 'hour']].agg(' '.join, axis=1)
#Set the time format for the time of accident variable
wrk_accident_pedestrians['TIMAccidentTime_accident'] = pd.to_datetime(wrk_accident_pedestrians['TIMAccidentTime_accident'], format='%H:%M:%S').dt.time
#Clean up the text white space in the DCA description variable
wrk_accident_pedestrians.CATDCADesc_accident = wrk_accident_pedestrians.CATDCADesc_accident.str.rstrip()
#Create and apply a group mapping for the hour of accident
mapping = {'00': 'Early Morning', '01': 'Early Morning', '02': 'Early Morning', '03': 'Early Morning', '04': 'Early Morning', '05': 'Early Morning',
'06': 'Morning', '07': 'Morning', '08': 'Morning', '09': 'Late Morning', '10': 'Late Morning', '11': 'Late Morning',
'12': 'Early Afternoon', '13': 'Early Afternoon', '14':'Early Afternoon', '15': 'Late Afternoon', '16': 'Late Afternoon',
'17': 'Evening', '18': 'Evening', '19': 'Evening', '20': 'Night', '21': 'Night', '22': 'Night', '23': 'Night' }
wrk_accident_pedestrians['hourgroup'] = wrk_accident_pedestrians.hour.map(mapping)
#Create a new feature which concatenates the week day name and accident hour group mapping
wrk_accident_pedestrians['CATWeekDayHourGroup'] = wrk_accident_pedestrians[['CATDayOfWeek_accident', 'hourgroup']].agg(' '.join, axis=1)
#Convert all categorical variables to strings
wrk_accident_pedestrians.CATAccidentTypeDesc_accident = wrk_accident_pedestrians.CATAccidentTypeDesc_accident.astype('string')
wrk_accident_pedestrians['CATDayOfWeek_accident'] = wrk_accident_pedestrians['CATDayOfWeek_accident'].astype('string')
wrk_accident_pedestrians['CATDCADesc_accident'] = wrk_accident_pedestrians['CATDCADesc_accident'].astype('string')
wrk_accident_pedestrians['CATMelwaysPage_accident'] = wrk_accident_pedestrians['CATMelwaysPage_accident'].astype('string')
wrk_accident_pedestrians['CATMelwaysGridRef_X_accident'] = wrk_accident_pedestrians['CATMelwaysGridRef_X_accident'].astype('string')
wrk_accident_pedestrians['CATMelwaysGridRef_Y_accident'] = wrk_accident_pedestrians['CATMelwaysGridRef_Y_accident'].astype('string')
wrk_accident_pedestrians['CATLightConditionDesc_accident'] = wrk_accident_pedestrians['CATLightConditionDesc_accident'].astype('string')
wrk_accident_pedestrians['CATRoadUserTypeDesc_person'] = wrk_accident_pedestrians['CATRoadUserTypeDesc_person'].astype('string')
wrk_accident_pedestrians['CATTakenHospital_person'] = wrk_accident_pedestrians['CATTakenHospital_person'].astype('string')
wrk_accident_pedestrians['CATInjuryLevelDesc_person'] = wrk_accident_pedestrians['CATInjuryLevelDesc_person'].astype('string')
wrk_accident_pedestrians['CATAgeGroup_person'] = wrk_accident_pedestrians['CATAgeGroup_person'].astype('string')
wrk_accident_pedestrians['CATPostcode_person'] = wrk_accident_pedestrians['CATPostcode_person'].astype('string')
wrk_accident_pedestrians['CATGender_person'] = wrk_accident_pedestrians['CATGender_person'].astype('string')
wrk_accident_pedestrians['CATLGAName_node'] = wrk_accident_pedestrians['CATLGAName_node'].astype('string')
wrk_accident_pedestrians['CATDEGUrbanName_node'] = wrk_accident_pedestrians['CATDEGUrbanName_node'].astype('string')
wrk_accident_pedestrians['CATPostcode_node'] = wrk_accident_pedestrians['CATPostcode_node'].astype('string')
wrk_accident_pedestrians['CATSurfaceConditionDesc_surface'] = wrk_accident_pedestrians['CATSurfaceConditionDesc_surface'].astype('string')
wrk_accident_pedestrians['CATSubDCACodeDesc_subdca'] = wrk_accident_pedestrians['CATSubDCACodeDesc_subdca'].astype('string')
wrk_accident_pedestrians['CATAtmosphericConditionDesc_atmosphere'] = wrk_accident_pedestrians['CATAtmosphericConditionDesc_atmosphere'].astype('string')
wrk_accident_pedestrians['CATRoadName_acclocation'] = wrk_accident_pedestrians['CATRoadName_acclocation'].astype('string')
wrk_accident_pedestrians['CATRoadNameInt_acclocation'] = wrk_accident_pedestrians['CATRoadNameInt_acclocation'].astype('string')
wrk_accident_pedestrians['CATRoadType_acclocation'] = wrk_accident_pedestrians['CATRoadType_acclocation'].astype('string')
wrk_accident_pedestrians['CATRoadTypeInt_acclocation'] = wrk_accident_pedestrians['CATRoadTypeInt_acclocation'].astype('string')
wrk_accident_pedestrians['CATEventTypeDesc_accevent'] = wrk_accident_pedestrians['CATEventTypeDesc_accevent'].astype('string')
wrk_accident_pedestrians['CATObjectTypeDesc_accevent'] = wrk_accident_pedestrians['CATObjectTypeDesc_accevent'].astype('string')
#Create a new feature which concatenates the accident road name and type variables
wrk_accident_pedestrians['CATAccidentRoadGroup'] = wrk_accident_pedestrians['CATRoadName_acclocation'].fillna('') + ' ' + wrk_accident_pedestrians['CATRoadType_acclocation'].fillna('')
#Convert all numerical variables to integer, except for longitude and latitude which will remain as a floating point.
wrk_accident_pedestrians['NUMVehiclesInvolved_accident'] = wrk_accident_pedestrians['NUMVehiclesInvolved_accident'].astype(int)
wrk_accident_pedestrians['NUMPersonsInvolved_accident'] = wrk_accident_pedestrians['NUMPersonsInvolved_accident'].astype(int)
wrk_accident_pedestrians['NUMPersonsInjured_accident'] = wrk_accident_pedestrians['NUMPersonsInjured_accident'].astype(int)
wrk_accident_pedestrians['NUMRecordCount'] = 1
#Print the information summary for the working dataset
wrk_accident_pedestrians.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2028 entries, 0 to 2027 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 KEYAccidentNumber 2028 non-null object 1 DATAccidentDate_accident 2028 non-null datetime64[ns] 2 TIMAccidentTime_accident 2028 non-null object 3 CATAccidentTypeDesc_accident 2028 non-null string 4 CATDayOfWeek_accident 2028 non-null string 5 CATDCADesc_accident 2028 non-null string 6 CATMelwaysPage_accident 2028 non-null string 7 CATMelwaysGridRef_X_accident 2028 non-null string 8 CATMelwaysGridRef_Y_accident 2028 non-null string 9 CATLightConditionDesc_accident 2028 non-null string 10 NUMVehiclesInvolved_accident 2028 non-null int32 11 NUMPersonsInvolved_accident 2028 non-null int32 12 NUMPersonsInjured_accident 2028 non-null int32 13 KEYPersonID_person 2028 non-null int64 14 CATRoadUserTypeDesc_person 2028 non-null string 15 CATTakenHospital_person 2028 non-null string 16 CATInjuryLevelDesc_person 2028 non-null string 17 CATAgeGroup_person 2028 non-null string 18 CATPostcode_person 1454 non-null string 19 CATGender_person 2028 non-null string 20 CATLGAName_node 2028 non-null string 21 CATDEGUrbanName_node 2028 non-null string 22 NUMLatitude_node 2028 non-null float64 23 NUMLongitude_node 2028 non-null float64 24 CATPostcode_node 2028 non-null string 25 CATSurfaceConditionDesc_surface 2028 non-null string 26 CATSubDCACodeDesc_subdca 1982 non-null string 27 CATAtmosphericConditionDesc_atmosphere 2028 non-null string 28 CATRoadName_acclocation 2027 non-null string 29 CATRoadNameInt_acclocation 2016 non-null string 30 CATRoadType_acclocation 2021 non-null string 31 CATRoadTypeInt_acclocation 2010 non-null string 32 CATEventTypeDesc_accevent 2028 non-null string 33 CATObjectTypeDesc_accevent 2028 non-null string 34 NUMDayOfWeek 2028 non-null object 35 NUMWeekOfYear 2028 non-null object 36 NUMDayOfYear 2028 non-null object 37 NUMYearOfAcc 2028 non-null object 38 hour 2028 non-null object 39 minute 2028 non-null object 40 second 2028 non-null object 41 CATWeekDayHour 2028 non-null object 42 hourgroup 2028 non-null object 43 CATWeekDayHourGroup 2028 non-null object 44 CATAccidentRoadGroup 2028 non-null string 45 NUMRecordCount 2028 non-null int64 dtypes: datetime64[ns](1), float64(2), int32(3), int64(2), object(12), string(26) memory usage: 705.2+ KB
Inspecting the value sets for each variable in the working dataset
Here we will broadly check the value sets for each variable. The information from this check will inform what types of values to expect for each column and cultivate thinking on what values constitute missing or invalid entries and how to deal with this situation.
# function to describe all columns with helpful summary statistics
def describe_all_columns(x):
print('Column summary:')
#select the summary function based on the input data type
if x.dtypes == 'float64' or x.dtypes == 'int64':
print(x.describe())
else:
#select the summary function based on the input data type
print(x.describe(include=[np.object]))
print(x.unique())
for i in wrk_accident_pedestrians.columns: #for each column in the dataframe
#print out summary statistics results
print('Column %s is of type %s.' % (wrk_accident_pedestrians[i].name, wrk_accident_pedestrians[i].dtypes))
describe_all_columns(wrk_accident_pedestrians[i])
print('\n\n')
Inspecting 'Accidents Per Year (All-Time)'
In this section we will explore and observe how many pedestrian accidents have occurred each year.
Important Note: The year 2020 is under-developed as an accident year as the last record date in the dataset is March 2020.
We will use 'seaborn' and 'matplotlib' libraries for visualisations.
# Time series plots
# Accidents per hour (all time)
#Create a summary dataset to display results
wrk_accident_pedestrians_yeargrp = wrk_accident_pedestrians.groupby('NUMYearOfAcc').agg(NUMAccidentsPerYear=('NUMYearOfAcc', 'count'))
wrk_accident_pedestrians_yeargrp.reset_index(drop=False, inplace=True)
print(wrk_accident_pedestrians_yeargrp)
#Plot the summarised data
sns.set(font_scale=1.5)
sns.set_style("whitegrid")
g = sns.catplot(y="NUMYearOfAcc", hue="CATInjuryLevelDesc_person",
kind="count",
palette="dark",
alpha=.6,
height=8,
aspect=1,
data=wrk_accident_pedestrians.sort_values(by='NUMYearOfAcc'))
g.despine(left=True)
g.set_axis_labels("Count of Accidents", "Year of Accident")
g.legend.set_title("Injury Level")
ax = plt.gca()
ax.set_title("Pedestrian Accidents Per Year (all time)")
NUMYearOfAcc NUMAccidentsPerYear 0 2008 184 1 2009 194 2 2010 200 3 2011 193 4 2012 160 5 2013 176 6 2014 152 7 2015 161 8 2016 147 9 2017 154 10 2018 136 11 2019 131 12 2020 40
Text(0.5, 1.0, 'Pedestrian Accidents Per Year (all time)')
Inspecting 'Accidents per Weekday (All-Time)'
In this section we will explore and observe how many pedestrian accidents have occurred by weekday.
We will use 'seaborn' and 'matplotlib' libraries for visualisations.
# Time series plots
# Accidents per week-day (all time)
#Create a summary dataset to display results
wrk_accident_pedestrians_weekdaygrp = wrk_accident_pedestrians.loc[:,'CATDayOfWeek_accident'].value_counts().to_frame()
wrk_accident_pedestrians_weekdaygrp.reset_index(drop=False, inplace=True)
print(wrk_accident_pedestrians_weekdaygrp)
#Plot the summarised data
sns.set(font_scale=1.5)
sns.set_style("whitegrid")
g = sns.catplot(x="CATDayOfWeek_accident", y="NUMRecordCount", hue="CATInjuryLevelDesc_person",
kind="bar",
palette="dark",
alpha=.6,
height=8,
aspect=2,
data=wrk_accident_pedestrians.sort_values(by='NUMDayOfWeek'),
estimator=sum)
g.despine(left=True)
g.set_axis_labels("Weekday of Accident", "Count of Accidents",)
g.legend.set_title("Injury Level")
ax = plt.gca()
ax.set_title("Pedestrian Accidents Per Weekday (all time)")
index CATDayOfWeek_accident 0 Friday 353 1 Wednesday 341 2 Thursday 307 3 Saturday 275 4 Tuesday 269 5 Monday 255 6 Sunday 228
Text(0.5, 1.0, 'Pedestrian Accidents Per Weekday (all time)')
Inspecting 'Accidents per Day (All-Days, Detailed)'
In this section we will explore and observe how many pedestrian accidents have occurred each day since the earliest date recorded in the dataset.
We will use 'seaborn' and 'matplotlib' libraries for visualisations.
# Time series plots
# Accidents per day (all-time)
#Create a summary dataset to display results
wrk_accident_pedestrians_daygrp = wrk_accident_pedestrians.loc[:,'DATAccidentDate_accident'].value_counts().to_frame()
wrk_accident_pedestrians_daygrp.reset_index(drop=False, inplace=True)
print(wrk_accident_pedestrians_daygrp)
#Plot the summarised data
sns.set(font_scale=1.5)
sns.set_style("whitegrid")
plt.figure(figsize = (15,8))
g = sns.lineplot(x="DATAccidentDate_accident", y="NUMRecordCount",
palette="dark",
alpha=.6,
data=wrk_accident_pedestrians.sort_values(by='DATAccidentDate_accident'),
estimator=sum)
plt.xticks(rotation=15)
ax = plt.gca()
ax.set_title("Pedestrian Accidents Per Day (all time)")
index DATAccidentDate_accident 0 2018-06-03 7 1 2010-04-25 6 2 2010-03-29 6 3 2012-03-14 4 4 2009-02-26 4 ... ... ... 1532 2018-03-15 1 1533 2018-02-24 1 1534 2009-10-26 1 1535 2017-10-23 1 1536 2020-02-11 1 [1537 rows x 2 columns]
Text(0.5, 1.0, 'Pedestrian Accidents Per Day (all time)')
Inspecting 'Accidents per Weekday and Hour grouping (All-Time)'
In this section we will explore and observe how many pedestrian accidents have occurred each weekday and hour grouping.
We will use 'seaborn' and 'matplotlib' libraries for visualisations.
# Time series plots
# Accidents per Weekday and Hour Grouping (all time)
#Create a summary dataset to display results
wrk_accident_pedestrians_weekdayhrgrp = wrk_accident_pedestrians.loc[:,'CATWeekDayHourGroup'].value_counts().to_frame()
wrk_accident_pedestrians_weekdayhrgrp.reset_index(drop=False, inplace=True)
print(wrk_accident_pedestrians_weekdayhrgrp)
#Plot the summarised data
sns.set(font_scale=1.5)
sns.set_style("whitegrid")
g = sns.catplot(x="CATWeekDayHourGroup", y="NUMRecordCount",
kind="bar",
palette="dark",
alpha=.6,
height=8,
aspect=2,
data=wrk_accident_pedestrians.sort_values(by=['NUMDayOfWeek']),
estimator=sum)
g.despine(left=True)
g.set_axis_labels("Weekday Hour of Accident", "Count of Accidents",)
plt.xticks(rotation=90)
ax = plt.gca()
ax.set_title("Pedestrian Accidents Per Weekday and Hour Grouping (all time)")
index CATWeekDayHourGroup 0 Sunday Early Morning 95 1 Wednesday Evening 77 2 Wednesday Early Afternoon 74 3 Friday Evening 74 4 Friday Night 69 5 Thursday Evening 66 6 Saturday Night 62 7 Friday Early Afternoon 61 8 Tuesday Evening 61 9 Wednesday Late Morning 61 10 Thursday Early Afternoon 59 11 Saturday Early Morning 55 12 Friday Late Morning 55 13 Monday Late Morning 51 14 Saturday Evening 51 15 Tuesday Late Morning 50 16 Monday Evening 49 17 Tuesday Early Afternoon 48 18 Thursday Late Morning 48 19 Thursday Morning 47 20 Wednesday Late Afternoon 46 21 Thursday Late Afternoon 39 22 Monday Early Afternoon 39 23 Friday Late Afternoon 39 24 Saturday Early Afternoon 38 25 Monday Morning 38 26 Thursday Night 36 27 Wednesday Morning 36 28 Tuesday Late Afternoon 36 29 Saturday Late Afternoon 35 30 Monday Late Afternoon 34 31 Wednesday Night 34 32 Tuesday Morning 33 33 Sunday Late Morning 33 34 Monday Night 32 35 Friday Morning 30 36 Tuesday Night 30 37 Sunday Late Afternoon 28 38 Friday Early Morning 25 39 Sunday Evening 24 40 Saturday Late Morning 22 41 Sunday Early Afternoon 20 42 Sunday Night 19 43 Wednesday Early Morning 13 44 Saturday Morning 12 45 Thursday Early Morning 12 46 Monday Early Morning 12 47 Tuesday Early Morning 11 48 Sunday Morning 9
Text(0.5, 1.0, 'Pedestrian Accidents Per Weekday and Hour Grouping (all time)')
Inspecting the 'Geography of Pedestrian Accident Occurrences'
In this section we will explore and observe the frequency of pedestrian accidents by geographical locations.
We will use 'seaborn' and 'matplotlib' libraries for visualisations.
# Geographical analysis. Streets with most accidents
# Which roads have the most frequent accidents?
#Create a summary dataset to display results
wrk_accident_pedestrians_roadsgrp = wrk_accident_pedestrians.loc[:,'CATAccidentRoadGroup'].value_counts().to_frame()
wrk_accident_pedestrians_roadsgrp.reset_index(drop=False, inplace=True)
wrk_accident_pedestrians_roadsgrp = wrk_accident_pedestrians_roadsgrp.head(50)
print(wrk_accident_pedestrians_roadsgrp)
#Plot the summarised data
sns.set(font_scale=1.5)
sns.set_style("whitegrid")
g = sns.catplot(x="CATAccidentRoadGroup", y="NUMRecordCount", hue="CATInjuryLevelDesc_person",
kind="bar",
palette="dark",
alpha=.6,
height=8,
aspect=2,
order=wrk_accident_pedestrians.CATAccidentRoadGroup.value_counts().iloc[:50].index,
data=wrk_accident_pedestrians.sort_values(by=['CATAccidentRoadGroup']),
estimator=sum)
g.despine(left=True)
plt.xticks(rotation=90)
g.set_axis_labels("Road Name", "Count of Accidents",)
g.legend.set_title("Injury Level")
ax = plt.gca()
ax.set_title("Pedestrian Accident Location - Roads Names (all time)")
index CATAccidentRoadGroup 0 ELIZABETH STREET 126 1 LONSDALE STREET 101 2 ST KILDA ROAD 101 3 COLLINS STREET 95 4 SPENCER STREET 83 5 KING STREET 82 6 BOURKE STREET 73 7 FLINDERS STREET 71 8 RACECOURSE ROAD 51 9 LA TROBE STREET 51 10 FLINDERS LANE 50 11 CLARENDON STREET 50 12 FLEMINGTON ROAD 40 13 ROYAL PARADE 40 14 SWANSTON STREET 36 15 EXHIBITION STREET 33 16 CITY ROAD 32 17 VICTORIA STREET 31 18 ELGIN STREET 29 19 LYGON STREET 26 20 PUNT ROAD 26 21 LITTLE COLLINS STREET 25 22 KINGS WAY 24 23 LITTLE BOURKE STREET 22 24 GRATTAN STREET 21 25 EPSOM ROAD 19 26 ALBERT STREET 18 27 PEEL STREET 17 28 SPRING STREET 17 29 TOORAK ROAD 16 30 HIGH STREET 16 31 LITTLE LONSDALE STREET 16 32 HODDLE STREET 15 33 DUDLEY STREET 14 34 NICHOLSON STREET 14 35 WELLINGTON PARADE 14 36 QUEEN STREET 14 37 CEMETERY ROAD 13 38 COMMERCIAL ROAD 12 39 DRUMMOND STREET 12 40 ABBOTSFORD STREET 11 41 RUSSELL STREET 11 42 ALEXANDRA AVENUE 10 43 FRANKLIN STREET 10 44 HARBOUR ESPLANADE 9 45 BATMAN AVENUE 9 46 QUEENS BRIDGE STREET 9 47 ABECKETT STREET 9 48 DRYBURGH STREET 9 49 WHITEMAN STREET 9
Text(0.5, 1.0, 'Pedestrian Accident Location - Roads Names (all time)')
Creating the first map visual to observe where pedestrian accidents are occurring
import folium
from folium.plugins import MarkerCluster
def map_visualization(data):
locations = []
for i in range(len(data)):
row =data.iloc[i]
location = [(row.NUMLatitude_node,row.NUMLongitude_node)]*int(row.NUMRecordCount)
locations += location
marker_cluster = MarkerCluster(
locations=locations,
overlay=True,
control=True,
)
m = folium.Map(location=[-37.81368709240999, 144.95738102347036], tiles="Cartodb Positron", zoom_start=13)
marker_cluster.add_to(m)
folium.LayerControl().add_to(m)
m
return m
map_visualization(wrk_accident_pedestrians)
Creating an Alternative Map Visual to Distinguish Accidents by Year and Injury Type
To observe the Melbourne Pedestrian Route JSON map overlayed with the Pedestrian Accident Data we can use the following code.
Filtering of both the Pedestrian Routes and Accident records will be performed to improve efficiency and performance of the geographical visualisations.
The Pedestrian Routes will be filtered to only 'High Traffic' footpaths The Accident records will be filtered to only those pedestrians who sustained serious or fatal injuries.
#Filter the accident dataset to persons who sustained serious or fatal injuries.
wrk_accident_pedestrians_filtered = wrk_accident_pedestrians.apply(
lambda row: row[wrk_accident_pedestrians['CATInjuryLevelDesc_person'].isin(['Serious injury','Fatality'])])
# Function to change the marker color
# according to the injury level
def color(CATInjuryLevelDesc_person):
if CATInjuryLevelDesc_person == "Not injured":
col = 'green'
elif CATInjuryLevelDesc_person == "Other injury":
col = 'blue'
elif CATInjuryLevelDesc_person == "Serious injury":
col = 'orange'
elif CATInjuryLevelDesc_person == "Fatality":
col = 'red'
else:
col='black'
return col
# Creating a map object using Map() function.
# Location parameter takes latitudes and
# longitudes as starting location.
# (Map will be centered at those co-ordinates)
m = folium.Map(location=[-37.81368709240999, 144.95738102347036],
zoom_start=12,
tiles="cartodbpositron",
control_scale=True,
prefer_canvas=True,
width=800,
height=580)
#Create a feature group by accident year
Year2008 = folium.FeatureGroup(name = 'Accident Year 2008')
Year2009 = folium.FeatureGroup(name = 'Accident Year 2009')
Year2010 = folium.FeatureGroup(name = 'Accident Year 2010')
Year2011 = folium.FeatureGroup(name = 'Accident Year 2011')
Year2012 = folium.FeatureGroup(name = 'Accident Year 2012')
Year2013 = folium.FeatureGroup(name = 'Accident Year 2013')
Year2014 = folium.FeatureGroup(name = 'Accident Year 2014')
Year2015 = folium.FeatureGroup(name = 'Accident Year 2015')
Year2016 = folium.FeatureGroup(name = 'Accident Year 2016')
Year2017 = folium.FeatureGroup(name = 'Accident Year 2017')
Year2018 = folium.FeatureGroup(name = 'Accident Year 2018')
Year2019 = folium.FeatureGroup(name = 'Accident Year 2019')
Year2020 = folium.FeatureGroup(name = 'Accident Year 2020')
#Loop through each row of the working dataset
for i, v in wrk_accident_pedestrians_filtered.iterrows():
accyear = int(v['NUMYearOfAcc'])
popup = """
Accident ID : <b>%s</b><br>
Year : <b>%s</b><br>
Injury : <b>%s</b><br>
Long : <b>%s</b><br>
Lat: <b>%s</b><br>
""" % (v['KEYAccidentNumber'], v['NUMYearOfAcc'],
v['CATInjuryLevelDesc_person'], v['NUMLongitude_node'],
v['NUMLatitude_node'])
# For each accident year in the dataset determine
# all marker points and add as separate layers so we can control the display for them
if accyear == 2008:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2008)
if accyear == 2009:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2009)
if accyear == 2010:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2010)
if accyear == 2011:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2011)
if accyear == 2012:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2012)
if accyear == 2013:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2013)
if accyear == 2014:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2014)
if accyear == 2015:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2015)
if accyear == 2016:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2016)
if accyear == 2017:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2017)
if accyear == 2018:
folium.Marker(location = [v['NUMLatitude_node'],v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2018)
if accyear == 2019:
folium.Marker(location = [v['NUMLatitude_node'],
v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2019)
if accyear == 2020:
folium.Marker(location = [v['NUMLatitude_node'],
v['NUMLongitude_node']],
tooltip = popup,
icon=folium.Icon(color=color(v['CATInjuryLevelDesc_person']),
icon_color='yellow',icon = 'male', prefix='fa')).add_to(Year2020)
#Add the layers to the base map
Year2008.add_to(m)
Year2009.add_to(m)
Year2010.add_to(m)
Year2011.add_to(m)
Year2012.add_to(m)
Year2013.add_to(m)
Year2014.add_to(m)
Year2015.add_to(m)
Year2016.add_to(m)
Year2017.add_to(m)
Year2018.add_to(m)
Year2019.add_to(m)
Year2020.add_to(m)
#Add in the pedestrian footpaths as a layer
folium.Choropleth(
#gpd_pedestrianpath[gpd_pedestrianpath.geometry.length>0.0015], #Optional: select only lines above specified length to plot
gpd_pedestrianpath_filtered,
name = "Pedestrian Paths",
line_weight=3,
line_color='blue',
control_scale=True,
prefer_canvas=True,
width=800,
height=580
).add_to(m)
#Add the map control
folium.LayerControl(collapsed = False).add_to(m)
#Show the map
m
#Save the map
#m.save('geoJSON_pedestrianaccidents_map.html')
This analysis has provided a comprehensive starting point for inspecting the Melbourne Open Data Pedestrian Network dataset and Traffic Accidents (Pedestrians) data.
We achieved in this analysis:
We learned from this analysis:
As a preliminary view, we observed that a majority of pedestrian accidents did occurr on 'High-Traffic' pedestrian network routes
At a broad level:
The total number of pedestrian accidents where pedestrians have been seriously of fatally injured has been reducing over time between the years of 2017 and 2019 (excluding the year 2020 as it was under-developed with only 3 months of data). More than 60 pedestrians in 2017 to less than 30 in 2019. This appears to be a positive and optimistic trend.
Overall, the week days of Wednesday and Friday appear to have the highest numbers of seriously and fatally injured pedestrians. Separate to this Wednesday afternoons & evening, Friday evening & night and Sunday early morning indicate the highest numbers of accidents involving pedestrians.
The top three roads with the higest number of seriously injured pedestrian include St Kilda Road, Elizabeth Street and Flinders Street.
Observations for further opportunities
[1] Thompson Road North Geelong Road Safety Improvements https://regionalroads.vic.gov.au/map/barwon-south-west-improvements/thompson-road-safety-improvements
[2] Victorian 'Crash-Stat's dataset https://discover.data.vic.gov.au/dataset/crash-stats-data-extract/resource/392b88c0-f010-491f-ac92-531c293de2e9
[3] Pedestrian Routes Dataset https://data.melbourne.vic.gov.au/Transport/Pedestrian-Network/4id4-tydi
Technical References
[4] Accessing geoJSON data https://stackoverflow.com/questions/48263802/finding-location-using-geojson-file-using-python
[5] Accessing geoJSON data https://medium.com/analytics-vidhya/measure-driving-distance-time-and-plot-routes-between-two-geographical-locations-using-python-39995dfea7e
[6] Visualising a geoJSON dataset https://python-visualization.github.io/folium/quickstart.html#GeoJSON/TopoJSON-Overlays
[7] Visualising categorised data on a map https://www.geeksforgeeks.org/python-adding-markers-to-volcano-locations-using-folium-package/
[8] Creating point plot group layers with folium https://towardsdatascience.com/creating-an-interactive-map-of-wildfire-data-using-folium-in-python-7d6373b6334a
[9] Ideas for further opportunities - Time Series Analysis https://geohackweek.github.io/ghw2018_web_portal_inlandwater_co2/InteractiveTimeSeries.html
!jupyter nbconvert --to html usecase-pedestriansafety-part1.ipynb